80 research outputs found
Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering
Many vision and language tasks require commonsense reasoning beyond
data-driven image and natural language processing. Here we adopt Visual
Question Answering (VQA) as an example task, where a system is expected to
answer a question in natural language about an image. Current state-of-the-art
systems attempted to solve the task using deep neural architectures and
achieved promising performance. However, the resulting systems are generally
opaque and they struggle in understanding questions for which extra knowledge
is required. In this paper, we present an explicit reasoning layer on top of a
set of penultimate neural network based systems. The reasoning layer enables
reasoning and answering questions where additional knowledge is required, and
at the same time provides an interpretable interface to the end users.
Specifically, the reasoning layer adopts a Probabilistic Soft Logic (PSL) based
engine to reason over a basket of inputs: visual relations, the semantic parse
of the question, and background ontological knowledge from word2vec and
ConceptNet. Experimental analysis of the answers and the key evidential
predicates generated on the VQA dataset validate our approach.Comment: 9 pages, 3 figures, AAAI 201
What Can I Do Around Here? Deep Functional Scene Understanding for Cognitive Robots
For robots that have the capability to interact with the physical environment
through their end effectors, understanding the surrounding scenes is not merely
a task of image classification or object recognition. To perform actual tasks,
it is critical for the robot to have a functional understanding of the visual
scene. Here, we address the problem of localizing and recognition of functional
areas from an arbitrary indoor scene, formulated as a two-stage deep learning
based detection pipeline. A new scene functionality testing-bed, which is
complied from two publicly available indoor scene datasets, is used for
evaluation. Our method is evaluated quantitatively on the new dataset,
demonstrating the ability to perform efficient recognition of functional areas
from arbitrary indoor scenes. We also demonstrate that our detection model can
be generalized onto novel indoor scenes by cross validating it with the images
from two different datasets
- …